Skip to content

Conversation

@etraut-openai
Copy link
Collaborator

Idle Codex CLI instances can get stuck after another concurrently-running instance refreshes and rotates the shared ChatGPT refresh token: the idle process wakes up, gets a 401, and its in-memory refresh token is no longer valid, so refresh fails permanently.

This change makes 401 recovery resilient to concurrent token rotation by first syncing ChatGPT tokens from the configured credential store (file/keyring/auto) and retrying the request, then performing a network refresh only if needed (using the refresh token loaded from storage). It also prevents accidental cross-account/workspace switching by only adopting/refreshing when chatgpt_account_id matches the request’s auth snapshot, and adds bounded retries on transient auth.json parse errors to handle concurrent truncate+write. Added unit tests for the storage-sync outcomes.

This addresses #6498, which several users have reported.

…tances

Idle Codex CLI instances can get stuck after another concurrently-running instance refreshes and rotates the shared ChatGPT refresh token: the idle process wakes up, gets a 401, and its in-memory refresh token is no longer valid, so refresh fails permanently.

This change makes 401 recovery resilient to concurrent token rotation by first syncing ChatGPT tokens from the configured credential store (file/keyring/auto) and retrying the request, then performing a network refresh only if needed (using the refresh token loaded from storage). It also prevents accidental cross-account/workspace switching by only adopting/refreshing when chatgpt_account_id matches the request’s auth snapshot, and adds bounded retries on transient auth.json parse errors to handle concurrent truncate+write. Added unit tests for the storage-sync outcomes.
@etraut-openai
Copy link
Collaborator Author

@codex review

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@pakrym-oai
Copy link
Collaborator

It also prevents accidental cross-account/workspace switching by only adopting/refreshing when chatgpt_account_id matches the request’s auth snapshot

Why is this required?

.await
.map_err(RefreshTokenError::Transient)?
else {
return Ok(None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should a method be extracted here that returns Optional and you can use ? to short circuit all these checks and return Ok(None);s?

auth: &Option<crate::auth::CodexAuth>,
) -> Result<()> {
if *refreshed {
if recovery.refreshed_token {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the refresh logic fully inside AuthManager so no external checking is needed? We can use some status endpoint to check whether the token is alive.

Will avoid every client having to maintain a complex recovery loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients already have a recovery loop. Implementing another recovery loop in the AuthManager seems a little redundant, but I agree that we can move more of the auth-specific recovery logic into AuthManager so it doesn't need to be repeated by each clients.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm mostly worried a about the fact that every client sending requests using token auth will need to reproduce this logic.

Copy link
Collaborator

@pakrym-oai pakrym-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can make both the refresh logic and the consumption logic simpler?

@etraut-openai
Copy link
Collaborator Author

@pakrym-oai, I updated AGENTS.md to reflect your feedback about "too many return code paths in one function". I'm trying to get into the habit of reflecting code review feedback in AGENTS.md so we can reduce the need for back-and-forth code review changes in the future. Let me know if you think that the instructions don't capture what you're looking for in terms of code style.

self.auth().map(|a| a.mode)
}

pub(crate) async fn sync_from_storage_for_request(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove pub

Ok(UnauthorizedRecoveryDecision::Retry)
}
SyncFromStorageResult::SkippedMissingIdentity => {
Ok(UnauthorizedRecoveryDecision::Retry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we retrying on missing identity? Isn't it fatal?

Ok(SyncFromStorageResult::Applied { changed })
}

pub(crate) async fn refresh_token_for_request(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rem pub

};

let storage =
create_auth_storage(self.codex_home.clone(), self.auth_credentials_store_mode);
Copy link
Collaborator

@pakrym-oai pakrym-oai Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use load_auth logic here and compare CodexAuth instances directly?

then we can use CodexAuth.refresh_token and avoid having another place where we update tokens

return Ok(SyncFromStorageResult::IdentityMismatch);
}

let changed = if let Some(current) = self.auth() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share this entire methods logic with reload() ? Seems very similar except for the extra identity check?

// Another instance may have refreshed and rotated the refresh token while we
// were attempting our refresh. Reload and retry once if the stored refresh
// token differs and identity still matches.
let Some(stored_refresh_token) = load_stored_refresh_token_if_identity_matches(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse sync_from_storage_for_request here?

so we reload the entire auth object if possible and then call refresh token on it if needed?

Ok(Some(tokens.refresh_token))
}

async fn load_auth_dot_json_with_retries(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default storage implementation do retries?

};

if stored_account_id != expected_account_id {
// Keep cached auth in sync for subsequent requests, but do not retry the in-flight
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If we refresh the cached auth the next request will pick it up. The client pulls auth() for retries -

let auth = auth_manager.as_ref().and_then(|m| m.auth());
let api_provider = self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants